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(57) ABSTRACT 

A method is provided for generating a high dimensional 
density model within an acoustic model for one of a speech 
and a speaker recognition system. Acoustic data obtained 
from at least one speaker is transformed into high dimen- 
sional feature vectors. The density model is formed to model 
the feature vectors by a mixture of compound Gaussians 
with a linear transform, wherein each compound Gaussian is 
associated with a compound Gaussian prior and models each 
coordinate of each component of the density model inde- 
pendently by a univariate Gaussian mixture comprising a 
univariate Gaussian prior, variance, and mean. An iterative 
expectation maximization (EM) method is applied to the 
feature vectors. The EM method includes the step of com- 
puting an auxiliary function Q of the EM method. 

19 Claims, 6 Drawing Sheets 
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HIGH DIMENSIONAL ACOUSTIC large tasks such as automatic speech recognition, one 

MODELING VIA MIXTURES OF assumes only mixtures of Gaussians with diagonal covari- 

COMPOUND GAUSSIANS WITH LINEAR ances. There are standard EM algorithms to estimate the 

TRANSFORMS mixture coefficients and the Gaussian means and covariance. 

5 However, in real applications, these parametric assumptions 

CROSS-REFERENCE TO RELATED are often violated, and the resulting parametric density 

APPLICATIONS estimates can be highly biased. For example, mixtures of 

This is a non-provisional application claiming the benefit diagonal Gaussians 
of provisional application Ser. No. 60/180,306, filed on Feb. 

4, 2000, the disclosure of which is incorporated by reference 10 p ^ _ y» n c{x v- Si) 

herein. £f 
This application is related to the application entitled 

"High Dimensional Data Mining and Visualization via . , . A At . . , t , . At . . 

n • - *- » w u • 1 j 1 roughly assume that the data is clustered, and within each 

Gaussianization , which is commonly assigned and concur- iC , , j. . ., , ' , ^ 

rently filed herewith, and the disclosure of which is incor- 15 ^ the dimensions are independent and Gaussian dis- 

porated herein by reference. < nbut 5 d f™?™' \ ^ ' ™ ? m ? aS10 ™ are ° f ' en 

correlated within each cluster. This leads to the need for 

BACKGROUND modeling the covariance of each mixture component. The 

following "semi-tied" covariance has been proposed: 

1. Technical Field 20 ^ 

The present invention generally relates to high dim en- z^AM 

sional data and, in particular, to methods for mining and where A is shared and for each component, A, is diagonal, 

visualizing high dimensional data through Gaussianization. ^ 5^.^ co-variance is described by: M. J. F. Gales, in 

2. Background Description ^ "Semi-tied Covariance Matrices for Hidden Markov 
Density Estimation in high dimensions is very challeng- Models", IEEE Transactions Speech and Audio Processing, 

ing due to the so-called "curse of dimensionality". That is, Vol. 7, pp. 272-81, May 1999; and R. A. Gopinath, in 

in high dimensions, data samples are often sparsely distrib- "Constrained Maximum Likelihood Modeling with Gauss- 

uted. Thus, density estimation requires very large neighbor- ian Distributions", Proc. of DARPA Speech Recognition 

hoods to achieve sufficient counts. However, such large ^ Q Workshop, February 8-11, Lansdowne, Va., 1998. Semi-tied 

neighborhoods could cause neighborhood-based techniques, covariance has been reported in the immediately preceding 

such as kernel methods and nearest neighbor methods, to be two articles to significantly improve the performance of 

highly biased. large vocabulary continuous speech recognition systems. It 

The exploratory projection pursuit density estimation should be appreciated that a compound Gaussian is no 

algorithm (hereinafter also referred to as the "exploratory 35 longer a diagonal Gaussian. 

projection pursuit") attempts to overcome the curse of Accordingly, there is a need for a method that transforms 

dimensionality by constructing high dimensional densities ni g n dimensional data into a standard Gaussian distribution 

via a sequence of univariate density estimates. At each which is computationally efficient, 
iteration, one finds the most non-Gaussian projection of the 

current data, and transforms that direction to univariate 4Q 

Gaussian. The exploratory projection pursuit is described by The present invention is directed to high dimensional 

J. H. Friedman, in "Exploratory Projection Pursuit", J. acoustic modeling via mixtures of compound Gaussians 

American Statistical Association, Vol. 82, No. 397, pp. with linear transforms. In addition to providing a novel 

249-66, 1987. density model within an acoustic model, the present inven- 

Recently, independent component analysis has attracted a 45 uon also provides an iterative expectation maximization 
considerable amount of attention. Independent component (EM) method which estimates the parameters of the mix- 
analysis attempts to recover the unobserved independent tures of the density model as well as of the linear transform, 
sources from linearly mixed observations. This seemingly According to a first aspect of the invention, a method is 
difficult problem can be solved by an information maximi- provided for generating a high dimensional density model 
zation approach that utilizes only the independence assump- 50 within an acoustic model for one of a speech and a speaker 
tion on the sources. Independent component analysis can be recognition system. The density model has a plurality of 
applied for source recovery in digital communication and in components, each component having a plurality of coordi- 
the "cocktail party" problem. A review of the current status nates corresponding to a feature space. The method includes 
of independent component analysis is described by Bell et the step of transforming acoustic data obtained from at least 
al., in "A Unifying Information-Theoretic Framework for 55 one speaker into high dimensional feature vectors. The 
Independent Component Analysis'*, International Journal on density model is formed to model the feature vectors by a 
Mathematics and Computer Modeling, 1999. Independent mixture of compound Gaussians with a linear transform, 
component analysis has been posed as a parametric proba- Each compound Gaussian is associated with a compound 
bilistic model, and a maximum likelihood EM algorithm has Gaussian prior and models each of the coordinates of each 
been derived, by H. Attias, in "Independent Factor 60 of the components of the density model independently by a 
Analysis", Neural Computation, Vol. 11, pp. 803-51, May univariate Gaussian mixture including a univariate Gaussian 
1999. prior, variance, and mean. 

Parametric density models, in particular Gaussian mixture According to a second aspect of the invention, the method 

density models, are the most widely applied models in large further includes the step of applying an iterative expectation 

scale high dimensional density estimation because they offer 65 maximization (EM) method to the feature vectors, to esti- 

decent performance with a relatively small number of mate the linear transform, the compound Gaussian priors, 

parameters. In fact, to limit the number of parameters in and the univariate Gaussian priors, variances, and means. 



SUMMARY OF THE INVENTION 
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According to a third aspect of the invention, the EM 
method includes the step of computing an auxiliary function 
Q of the EM method. The compound Gaussian priors and the 
univariate Gaussian priors are respectively updated, to maxi- 
mize the auxiliary function Q. The univariate Gaussian 
variances, the linear transform, and the univariate Gaussian 
means are respectively updated to maximize the auxiliary 
function Q, the linear transform being updated row by row. 
The second updating step is repeated, until the auxiliary 
function Q converges to a local maximum. The computing 
step and the second updating step are repeated, until a log 
likelihood of the feature vectors converges to a local maxi- 
mum. 

According to a fourth aspect of the invention, the method 
further includes the step of updating the density model to 
model the feature vectors by the mixture of compound 
Gaussians with the updated linear transform. Each of the 
compound Gaussians is associated with one of the updated 
compound Gaussian priors and models each of the coordi- 
nates of each of the components independently by the 
univariate Gaussian mixtures including the updated univari- 
ate Gaussian priors, variances, and means. 

According to a fifth aspect of the invention, the linear 
transform is fixed, when the univariate Gaussian variances 
are updated. 

According to a sixth aspect of the invention, the univariate 
Gaussian variances are fixed, when the linear transform is 
updated. 

According to a seventh aspect of the invention, the linear 
transform is fixed, when the univariate Gaussian means are 
updated. 

These and other aspects, features and advantages of the 
present invention will become apparent from the following 
detailed description of preferred embodiments, which is to 
be read in connection with the accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a flow diagram illustrating a method for mining 
high dimensional data according to an illustrative embodi- 
ment of the present invention; 

FIG. 2 is a flow diagram illustrating a method for visu- 
alizing high dimensional data according to an illustrative 
embodiment of the present invention; 

FIG. 3 is a flow diagram illustrating steps 112 and 212 of 
FIGS. 1 and 2, respectively, in further detail according to an 
illustrative embodiment of the present invention; 

FIG. 4 is a flow diagram of a method for generating a high 
dimensional density model within an acoustic model for a 
speech and/or a speaker recognition system, according to an 
illustrative embodiment of the invention; 

FIG. 5 is a flow diagram of a method for generating a high 
dimensional density model within an acoustic model for a 
speech and/or a speaker recognition system, according to an 
illustrative embodiment of the invention; and 

FIG. 6 is a block diagram of a computer processing 
system 600 to which the present invention may be applied 
according to an embodiment thereof. 

DETAILED DESCRIPTION OF PREFERRED 
EMBODIMENTS 

The present invention is directed to high dimensional 
acoustic modeling via mixtures of compound Gaussians 
with linear transforms. In addition to providing a novel 
density model within an acoustic model for a speech and/or 
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speaker recognition system, the present invention also pro- 
vides an iterative expectation maximization (EM) method 
which estimates the parameters of the mixtures of the 
density model as well as of the linear transform. 

A general description of the present invention will now be 
given with respect to FIGS. 1-5 to introduce the reader to the 
concepts of the invention. Subsequently, more detailed 
descriptions of various aspects of the invention will be 
provided. 

FIG. 1 is a flow diagram illustrating a method for mining 
high dimensional data according to an illustrative embodi- 
ment of the present invention. The method of FIG. 1 is an 
implementation of the iterative Gaussianization method 
mentioned above and described in further detail below. 

The log likelihood of the high dimensional data is com- 
puted (step 110). The high dimensional data is linearly 
transformed into less dependent coordinates, by applying a 
linear transform of n rows by n columns to the high 
dimensional data (step 112). Each of the coordinates are 
marginally Gaussianized, said Gaussianization being char- 
acterized by univariate Gaussian means, priors, and vari- 
ances (step 114), 

It is then determined whether the coordinates converge to 
a standard Gaussian distribution (step 116). If not, then the 
method returns to step 112. As is evident, the transforming 
and Gaussianizing steps (112 and 114, respectively) are 
iteratively repeated until the coordinates converge to a 
standard Gaussian distribution, as determined at step 116. 

If the coordinates do converge to a standard Gaussian 
distribution, then the coordinates of all iterations are 
arranged hierarchically to facilitate data mining (step 118). 
The arranged coordinates are then mined (step 120). 

FIG. 2 is a flow diagram illustrating a method for visu- 
alizing high dimensional data according to an illustrative 
embodiment of the present invention. The method of FIG. 2 
is also an implementation of the iterative Gaussianization 
method mentioned above and described in further detail 
below. 

The log likelihood of the high dimensional data is com- 
puted (step 210). The high dimensional data is linearly 
transformed into less dependent coordinates, by applying a 
linear transform of n rows by n columns to the high 
dimensional data (step 212). Each of the coordinates are 
marginally Gaussianized, said Gaussianization being char- 
acterized by univariate Gaussian means, priors, and vari- 
ances (step 214). 

It is then determined whether the coordinates converge to 
a standard Gaussian distribution (step 216). If not, then the 
method returns to step 212. As is evident, the transforming 
and Gaussianizing steps (212 and 214, respectively) are 
iteratively repeated until the coordinates converge to a 
standard Gaussian distribution, as determined at step 216. 

If the coordinates do converge to a standard Gaussian 
distribution, then the coordinates of all iterations are 
arranged hierarchically into high dimensional data sets to 
facilitate data visualization (step 218). The high dimensional 
data sets are then visualized (step 220). 

FIG. 3 is a flow diagram illustrating steps 112 and 212 of 
FIGS. 1 and 2, respectively, in further detail according to an 
illustrative embodiment of the present invention. It is to be 
appreciated that the method of FIG. 3 is an implementation 
of the iterative expectation maximization (EM) method of 
the invention mentioned above and described in further 
detail below. 

The auxiliary function Q is computed from the log 
likelihood of the high dimensional data (step 310). It is then 
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determined whether this is the first iteration of the EM Acoustic data obtained from at least one speaker is 

method of FIG. 3 (step 312). If so, then the method proceeds transformed into high dimensional feature vectors (step 

to step 314. Otherwise, the method proceeds to step 316 510). The density model is formed to model the feature 

(thereby skipping step 314). vectors by a mixture of compound Gaussians with a linear 

At step 314, the univariate Gaussian priors are updated, to 5 transform (step 512). Each compound Gaussian is associated 

maximize the auxiliary function Q. At page 316, the univari- with a compound Gaussian prior. Also, each compound 

ate Gaussian variances are updated, while maintaining the Gaussian models each coordinate of each component of the 

linear transform fixed, to maximize the auxiliary function Q. density model independently by a univariate Gaussian mix- 

The linear transform is updated row by row, while main- ture - Each univariate Gaussian mixture comprises a univari- 

taining the univariate Gaussian variances fixed, to maximize 10 ate Gaussian prior, variance, and mean, 

the auxiliary function Q (step 318). The univariate Gaussian An iterative expectation maximization (EM) method is 

means are updated, while maintaining the linear transform applied to the feature vectors (step 514). The EM method 

fixed, to maximize the auxiliary function Q (step 320). includes steps 516-528. At step 516, an auxiliary function of 

It is then determined whether the auxiliary function Q the EM method is computed from the log likelihood of the 

converges to a local maximum (step 322). If not, then the 15 hi g h dimensional feature vectors. It is then determined 

method returns to step 316. As is evident, the steps of whether this is the first iteration of the EM method of FIG. 

updating the univariate Gaussian variances, the linear 5 ( ste P 518 )- If so > tnen the method proceeds to step 520. 

transform, and the univariate Gaussian means (316, 318, and Otherwise, the method proceeds to step 524 (thereby skip- 

320, respectively) are iteratively repeated until the auxiliary P m S ste P s 5 ^0 and 522). 

function Q converges to a local maximum, as determined at 20 At step 520, the compound Gaussian priors are updated, 

step 322. to maximize the auxiliary function Q. At step 522, the 

If the auxiliary function Q converges to a local maximum, univariate Gaussian priors are updated, to maximize the 

then it is determined whether the log likelihood of the high auxiliary function Q. 

dimensional data converges to a local maximum (step 324). ^ At step 524, the univariate Gaussian variances are 

If not, the method returns to step 310. As is evident, the updated, while maintaining the linear transform fixed, to 

computing step (310) and all of the updating steps other than maximize the auxiliary function Q. The linear transform is 

the step of updating the univariate Gaussian priors updated row by row, while maintaining the univariate Gaus- 

(316-320) are iteratively repeated until the auxiliary func- sian variances fixed, to maximize the auxiliary function Q 

tion Q converges to a local maximum (as determined at step 3Q (step 526). The univariate Gaussian means are updated, 

322) and the log likelihood of the high dimensional data while maintaining the linear transform fixed, to maximize 

converges to a local maximum (as determined at step 324). the auxiliary function Q (step 528). 

If the log likelihood of the high dimensional data converges It is thcn determined whether the auxiliary function 0 

to a local maximum, then the EM method is terminated. converges to a local maximum (step 530). If not, then the 

FIG. 4 is a flow diagram of a method for generating a high 35 method returns to step 524. As is evident, the steps of 

dimensional density model within an acoustic model for a updating the univariate Gaussian variances, the linear 

speech and/or a speaker recognition system, according to an transform, and the univariate Gaussian means (524, 526, and 

illustrative embodiment of the invention. The density model 528, respectively) are iteratively repeated until the auxiliary 

has a plurality of components, each component having a function Q converges to a local maximum, as determined at 

plurality of coordinates corresponding to a feature space. It 4Q step 530. 

is to be noted that the method of FIG. 4 is directed to If the auxiliary function Q converges to a local maximum, 

forming a density model having a novel structure as further then it is determined whether the log likelihood of the high 

described below. dimensional feature vectors converges to a local maximum 

Acoustic data obtained from at least one speaker is (step 532). If not, the method returns to step 516. As is 

transformed into high dimensional feature vectors (step 45 evident, the computing step (516) and all of the updating 

410). The density model is formed to model the high steps other than the steps of updating the compound and 

dimensional feature vectors by a mixture of compound univariate Gaussian priors (524-528) are iteratively 

Gaussians with a linear transform (step 412). Each of the repeated until the auxiliary function Q converges to a local 

compound Gaussians of the mixture is associated with a , maximum (as determined at step 532). 

compound Gaussian prior and models each coordinate of 50 if the log likelihood of the high dimensional feature 

each component of the density model independently by. vectors converges to a local maximum, then the density 

univariate Gaussian mixtures. Each univariate Gaussian mo del is updated to model the feature vectors by the mixture 

mixture comprises a umvanate Gaussian prior, vanance, and of Gaussians with the updated linear transform 

mean ' (step 534). Each compound Gaussian is associated with an 

FIG. 5 is a flow diagram of a method for generating a high 55 updated compound Gaussian prior and models each coordi- 

dimensional density model within an acoustic model for a nate of each component independently by the univariate 

speech and/or a speaker recognition system, according to Gaussian mixtures. Each univariate Gaussian mixture com- 

another illustrative embodiment of the present invention. prises an updated univariate Gaussian prior, variance, and 

The density model has a plurality of components, each mean. 

component having a plurality of coordinates corresponding « Mon detailed descri ptions of various aspects of the 

to a feature space It is to be noted that toe method of FIG n , inymtion wil , now be ided . ^ t mven . 

5 is directed to formmg a density model having a novel ^ ides a me(hod , hat transforms mu i t j. dimensional 

structure as per FIG. 4 and further to estimaUng the param- random variaWes mt(J sUndard Gaussjan random yiliMts 

eters of the density model. The method of FIG. 5 is an For a random variablc XeR „ the Gaussianizalion , rans f onn 

implementation of the iterative expectation maximization 65 T ^ deflned u an mvertible t ransformauon of X such that 
method of the invention mentioned above and described in 

further detail below. Tpfy-ff(p/). 
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The Gaussianization transform corresponds to a density vided wherein, at each iteration, a maximum-likelihood 
estimation parameter estimation problem is solved. Let (j>(x) be the 

probability density function of the standard normal 



m 



F{x)-P(X£x). 



The Gaussianization method of the invention is iterative 
and converges. Each iteration is parameterized by univariate and let <X>(x) be the cumulative distribution function (CDF) 
parametric density estimates and can be efficiently solved by 30 of the standard normal 
an expectation maximization (EM) algorithm. Since the 

Gaussianization method of the invention involves univariate f* 1 ( y 2 ) 

density estimates only, it has the potential to overcome the *(*) = J ^^ cxp l"T f y ' 

curse of dimensionality. Some of the features of the Gaus- 
sianization method of the invention include the following: 15 

(1) The Gaussianization method of the invention can be A description of one dimensional Gaussianization will 
viewed as a parametric generalization of Friedman's now be gi ven - La us first consider the univariate case: XcR 1 . 
exploratory projection pursuit and can be easily accom- Let F(X) be the cumulative distribution function of X: 
modated to perform arbitrary d-dimensional projection 
pursuit. 20 

(2) Each of the iterations of the Gaussianization method [t ca n be easily verified that 
of the invention can be viewed as solving a problem of 

independent component analysis by maximum likeli- * _i (f(X)>jv(o,i). (1) 

hood. Thus, according to an embodiment of the 

invention, an EM method is provided that is computa- 25 In practice, the CDF F(X) is not available; it has to be 
tionaUy more attractive than the noiseless independent estimated from the training data. According to an ernbodi- 
factor analysis algorithm described by H. Attias, in mcnt of thc invention, it is approximated by Gaussian 
"Independent Factor Analysis", Neural Computation, mixture models 
Vol. 11, pp. 803-51, May 1999. 

(3) The probabilistic models of the invention can be 30 p ^ = y XjC ^ x ^ ^ 
viewed as a generalization of the mixtures of diagonal ^ 
Gaussians with explicit description of the non- 

Gaussianity of each dimension and the dependency 

among the dimensions. we lhe CDF 

(4) The Gaussianization method of the present invention f 
induces independent structures which can be arranged F ^ = y Xi $tiZJl\ 
hierarchically as a tree. £i ^ 

In the problem of classification, we are often interested in 
transforming the original feature to obtain more discrimi- Q 

native features. To this end, most of the focus of the prior art Therefore, we parameterize the Gaussianization transform 
has been on linear transforms such as: linear discriminant as 
analysis (LDA); maximum likelihood linear transform 

(MLLT); and semi-tied covariances. In contrast, the present _ ( y / * - ^ ; 

invention provides nonlinear feature extraction based on 45 '"^Zj"' ( tr ; JJ 

Gaussianization. Efficient EM type methods are provided 
herein to estimate such nonlinear transforms. 

A description of Gaussianization will now be given. For where the parameters {k ( , oj can be estimated via 
a random variable XeR", we define its Gaussinization trans- maximum likelihood using the standard EM algorithm, 
form to be invertible and its differential transform T(X) such 5Q A description of the existence of high dimensional Gaus- 
that the transformed variable T(X) follows the standard sianization will now be given. For any random variable 
Gaussian distribution: XeR", the Gaussianization transform can be constructed 

theoretically as a sequence of n one dimensional Gaussian- 
T&HWJ). ization. Let X (0) =X. Let p (0 \xi ( °\ ■ • - , x„<°>) be the 

VT ^ * « probability density function. First, we Gaussianize the first 

Naturally, the following questions arise. Does a Gaussi- 55 coordinate x < 0) 
anization transform exist? If so, is it unique? Moreover, how 1 
can one construct a Gaussianization transformation from x l (l) ~i i ^(x l i0> )m<p(F x m' 1 ^ 0 *)) 

samples of the random variable X? As is shown below with 

minor regularity assumptions on the probability density of where F^)" 1 is the marginal CDF of Xj (0) 

X, a Gaussianization transform exists and the transform is 60 ^ 

not unique. The constructive algorithm, however, is not ^"H^V^i^^A 

amenable to being used in practice on samples of X. For ^ remaini are left unchanged 

estimation of the Gaussianization transform from sample 

data (i.e., given i.i.d. observations {X^l^i^L} and regu- xj^-xj® d~% , . . n 

larity conditions on the probability density function of X 65 

viz., that it is strictly positive and continuously Let p^x/^, . . . , x„ (1) ) be the density of the transformed 
differentiable), an iterative Gaussianization method is pro- variable X (1) . Clearly, 
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p'V," = MVM' -<i" 1 4") 



We can then Gaussianize the conditional distribution p (1) (x 2 

where ^(F^ ^w -1 ) is the CDF of the conditional density 
P (1 W 1) |Xi (i3 )- The remaining coordinates are left 
unchanged: 

xj^jv rf-l, 3 TL 

Let p (2) (x 1 (2) , . . . , x„ (2) ) be the density of the transformed 
variable X (2) : 



and parametrize the Gaussianization transform as 



Then, we can Gaussianize the conditional density p C2) (x 3 <2) 
l x i (2) » x 2 <2) ) a °d so on - After n steps, we obtain the trans- 
formed variable x c "^ which is standard Gaussian: 

pC^eo, . . . xW)-$(x„t»>) 

The above construction is not practical (i.e., we cannot 
apply it easily when we have sample data from the original 
random variable X) since at the (k+l)-th step, it requires the 
conditional density p^x^/^lx/^, . . . x k ^) for all possible 
x^ k \ . . . Xjt ( *\ which is extremely difficult given finite 
sample points. It is clear that the Gaussianization transform 
is not unique since the construction above could have used 
any other ordering of the coordinates. Advantageously, one 
embodiment of the invention provides a novel, iterative 
Gaussianization method that is practical and converges; 
proof of the former is provided below. 

A description of Gaussianization with independent u: 
component analysis assumption will now be given. Let 
X=(x lt . . . , xj be the high dimensional random variable 
to be Gaussianized. If we assume that the individual dimen- 
sions are independent, i.e., 

p{xx> • • ■ , *„)-/>(*i) • ■ ■ P<x n )> 

we can then simply Gaussianize each dimension by the 
univariate Gaussianization and obtain the global Gaussian- 
ization. 

However, the above independence assumption is rarely 
valid in practice. Thus, we can relax the assumption by using 
the following independent component analysis assumption: 
assume that there exists a linear transform A^ n such that the 
transformed variable 



and 



r«(vi >y-AX. 

has independent components: 
pCxi — » y>p(y x ) • • • p(y„) 

Therefore, we can first find the linear transformation A, 
and then Gaussianize each individual dimension of Y via 
univariate Gaussianization. The linear transform A can be 
recovered by independent component analysis, as described 
by Bell et aL, in "An Information-Maximization Approach 
to Blind Separation and Blind Deconvolution", Neural 
Computation, Vol. 7, pp. 1004-34, November 1999. 

As in the univariate case, we model each dimension of Y 
by a mixture of univariate Gaussians 



The parameters 0=(A, fi d ^ o d J) can be efficiently 
estimated by maximum likelihood via EM. Let 
20 {x^eR^l^nSN} be the training set. Let {y„« 
Ax M :l^n^N} be the transformed data. Let {y^R^, x^eR^, 
z n eN°:l^n^N} be the complete data, where z n =(z nl . . . 
z niD ) indicates the index of the Gaussian component along 
each dimension. It is clear that 



25 



30 



35 



40 



45 



: 50 



55 



60 



65 



r^Multinomia^ 1,(31^!, . . . , n^J) 
y\> • • • i .Vrfkrt"' are independent 

We convert z n d into binary variables {(z n d l , . . . , z„ rf ^}: 
Therefore, 

d-i 

d=l /=1 

We obtain the following complete log likelihood 
Ux, in, 0) = 

£ Lai + £ £ ^Jio^., - w^. - ^^1) 

where 8=(A, k, 0). 

In the E-step, we compute the auxiliary function 

g(e,e)^(i(*iz,e>,e) 
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11 

•d 



12 



In the M-step, we maximize the auxiliary function Q to 
update the parameters 8 as 

arg maX6g(9,e). 

From the first order conditions on (ji, a), we have 



In the prior art, gradient descent was used to optimize Q, 
as described by H. Attias, in "Independent Factor Analysis", 
Neural Computation, Vol. 11, pp. 803-51, May 1999. 
According to the prior art, the following steps were per- 
5 formed at each iteration: 

(1) Fix A and compute the Gaussian mixture parameters 
(re, ^ o) by (3). 

(2) Fix the Gaussian mixture parameters (jr., //, a) and 
10 update A via gradient descent using the natural gradi- 
ent: 



2 ^ 



X X ««Af 

/=1 n=l 



(3) 



15 



A = A+i}~A T A 
OA 



(5) 



( N D 'd 

T'-IZZ 

\ «=t d-l (=1 



y n 4-»d:< t a t a 

o^j — -j- — edX* n A' A 



oil ■ 



*k,d N 

Z Z W*df 
\>=\ «t=l 



«=1 

l k,d N 

Z Z &w 



where a rf is the d-th row of the matrix A and 



l t.d N 

Z Z WW.! 1 



20 where r|>0 determines the learning rate. The natural 
gradient is described by S. Amari, in "Natural Gradient 
Works Efficiently in Learning", Neural Computation, 
Vol. 10, pp. 251-76, 1998. 
According to an embodiment of the present invention, an 
25 iterative gradient descent method will now be derived which 
does not involve the nuisance of determining the stepsize 
parameter r\. At each iteration of the iterative gradient 
descent method of the invention, for each of the rows of the 
matrix A: 

(1) Fix A and compute the Gaussian mixture parameters 
(re, p, a) by (3). 

(2) Update each row of A with all the other rows of A and 
the Gaussian mixture parameters (re, ju? a) fixed. Let a ; . 

35 be the j-th row; let c~(c lf . . . , d^) be the co-factors of 
A associated with the j-th row. Note that a ; - and Cy are 
both row vectors. Then 



Z= ; 



%d N 

I Z «w 



1 N D Id t \1 

Q = Nlo&ajcj)- i2^hlu + 



n-l d-l i-l 



= Nlogf a,cj) - 5 J] a d G d al + £ a d h T d + const 



Let e rf be the column vector which is 1 at the position d: 45 

^(0, .... 0, 1, 0 0) r 



We obtain the gradient of the auxiliary function with 
respect to A: 



dQ i t 



H D Id 

n=l d=\ i=l 

N D *d 



yn4 -fru dynj 



ynJ-Pd,i T 

3 e * X n 

°di 



N 0 Id 



50 



(4) 



55 



60 



N(A -ur f V V a ^~^}) T 

= H(A l ) -^hL w edX »* 

" al o^Z «J 

d.i 



Notice that this gradient is nonlinear in A- Therefore, 65 
solving the first order equations is nontrivial and it requires 
iterative techniques to optimize the auxiliary function Q. 



where 



*d t (N \ 



Therefore, 



daj aj c T j 
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Let 

where px and py are the densities of X and Y respectively. 
So we have a^p^.+h^G,- 1 and In particular, we define the negentropy of a random variable 

X as 



(ficj + hj^c? 10 



It is to be noted that we are taking some slight liberty with 
the terminology; normally the negentropy of a random 

j e variable is defined to be the Kullback-Leibler distance 

between itself and the Gaussian variable with the same mean 

PV/V+P^/V-^- 0 - is ^ d covar . iance * . 

In traditional information theory, mutual information is 
defined to describe the dependency between two random 

Therefore, the j-th row of a y can be updated as variables. According to the invention, the mutual informa- 
tion of one random variable X is defined to describe the 

a r® c i +h /) G j 1 ^ dependence among the dimensions: 

where p can be solved from the quadratic equation (6). = C px ^ *„)iog pxfri * — 

It is to be appreciated that the iterative Gaussianization A"i(*i)-"P* m W 
method of the present invention increases the auxiliary 

function Q at every iteration. It would seem that after 25 where pxjx d ) is the marginal density of X rf . Clearly I(X) is 

updating a particular row a,-, we are required to go back and always nonnegative and l(X)=0±*X ly . . . X„ are mutually 

update the Gaussian mixture parameters (jt, a) to guar- independent. 

antec improvement on Q. However, that is not necessary A description of six assumptions relating to negentropy 

because of the following two observations: and mutual information will now be given, followed by a 

(1) The update on a, depends only on (jt, o /t ) but not 30 proof thereof. 

on (jz d p d o d *d*j) first, negentropy can be decomposed as the sum of 

.^^ d ' P , d,P d f . , , marginal negentropies and the mutual information. Thus, for 

(2) The update on ^ a, J) depends only on a, but not (he pllrposes of (he invention> assume tha , for any random 

on ( a</ :d*j). varjable x=(Xj) x jr 

Therefore, it is equivalent that we update A, row by row, 35 

with the Gaussian parameters fixed and update all the „ 

Gaussian mixture parameters in the next iteration. J{X) = ^ /(*,•) + i{X). 

An EM algorithm for the same estimation problem; 1=1 
referred to as the noiseless independent factor analysis, was 

described by H. Attias, in "Independent Factor Analysis", 40 \y e shall call the sum of negentropies of each dimension the 
Neural Computation, Vol. 11, pp. 803-51, May 1999. The marginal negentropy 
M-step in the algorithm described by Attias involves gradi- 
ent descent based on natural gradient. In contrast, our « 
M-step involves closed form solution of the rows of A; that Jm(X) = ^ JU,). 
is, there is no gradient descent. Since the iterative Gaussi- 45 1=1 
anization method of the invention advantageously increases 

the auxiliary function Q at every iteration, it converges to a Secondj nege ntropy is invariant under orthogonal linear 

local maximum. transforms. Thus, for the purposes of the invention, assume 

A description of iterative Gaussianization transforms for that for any random variable XeR n and orthogona i matrix 

arbitrary random variables according to various embodi- so ^ 
ments of the present invention will now be given. Conver- 

gence results are also provided. J(fiX)*J{X). 

The iterative Gaussianization method of the invention is This easily follows from the fact that the Kullback-Leibler 

based on a series of univariate Gaussianization. We param- distance is invariant under invertible transforms: 

eterize univariate Gaussianization via mixtures of univariate ss 

Gaussians, as described above. In our analysis we do not JfttWAflY^^ 

consider the approximation error of univariate variables by Third, let *P be the ideal marginal Gaussianization opera- 
mixtures of univariate Gaussians. In other words, we assume tor which ideally Gaussianizes each dimension. Thus, for the 
that the method uses the ideal univariate. Gaussianization purposes of the invention, assume that for any random 
transform rather than its parametric approximation using 60 variable X, 
Gaussian mixtures. 

To define and analyze the iterative Gaussianization *WWr ■ 

method of the invention, we first establish some notations Fourth, mutual information is invariant under any invert- 

and assumptions. ible dimension-wise transforms. Thus, for the purposes of 

We define the distance between two random variables X 65 the invention, let any random variable X-(xj, . . . x„) r ; we 

and Y to be the Kullback-Leibler distance between their transform X by invertible dimension-wise transforms f,.(x) 

density functions iR^R 1 
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let Y=(y a , . . . , y„) r be the transformed variable. Then 

Fifth, the L 1 distance between two densities is bounded 
from above by the square root of the Kullback-Leibler 
distance. Thus, for the purposes of the invention, let f(x) and 
g(x) be two n-dirnensional density functions, then 

^)-8{x)\dx^[2D{J\\8)f U 



where the parameters 8=(A,n^,a) and a * s tne Marginal 
Gaussianization operator parameterized by univariate Gaus- 
sian mixture models that Gaussianizes each individual 
dimension as per equation (2). Clearly, when the number of 
Gaussian components goes to infinity, we achieve perfect 
Gaussianization 



Urn J{T 9 (X))~0. 
V ..,/„-<» 



Sixth, a weak convergence result is proved on random 

variables, similar to Proposition 14.2 described by P. J. For the arbitrary random variable X, T e does not neces- 

Huber, in "Projection pursuit", Annals of Statistics, Vol. 13, sarily achieve Gaussianization. However, we can still run the 

pp. 435-525, April 1985. Thus, for the purposes of the 15 same maximum likelihood EM algorithm. In fact, if we 

invention, let X and (X (1) , X (2) , . . . ) be random variable in iteratively apply T e , we can prove that T e achieves Gaus- 

R"; then sianization. More specifically, let X (0) »X; let 



lim sup Dia^VX)^}, 

Hair 1 20 

implies 

A^Jf weakly. 

Regarding the proof of the six assumptions above, let p ( *\ p, is 
Pa**. Pa be the density functions of X ( *\ Y, a^X^, 
respectively. By the fifth assumption, we have: 



,(8) 



where the parameters 8 (i) are estimated from the data 
associated with the k-th generation X*. We shall prove that 
X (Ar) converges to the standard normal distribution. 

A description of the relationship of iterative Gaussianiza- 
tion to maximum likelihood will now be given. We first 
show that the maximum likelihood EM method of the 
invention described above actually minimizes the negent- 
ropy distance 



sup 



$ ] -Po\£ sup [2D(p#>|1 Pa )]--+0. 



30 



T e 



(9) 



Hence, the characteristic functions of p a w converge 
uniformly to the characteristic function ip a of p a : 

Let and W be the characteristic functions of the joint 
densities p CAr) and p, respectively. The marginal characteristic 
functions can be related to the joint characteristic functions 
as 

Therefore, 



Let X, YeR" be two random variables; let T e :R"-»R" be a 
parameterized invertible transform with parameters 9. Then, 
35 finding the best transform 

rmnD[T e {X)\\Y) 

40 is equivalent to maximizing a likelihood function which is 
parameterized by 8: 



sup^m-ijfain sup\^ k \9a) - ij,(9a)\ - 0. 
a.8 aj) 

45 

In particular by setting 8-1, 

sup\\l/ {k \a) - tft(a)\ -> 0, 
a 

50 

i.e., the characteristic functions of X (t) converge uniformly 
to the characteristic function of X. Therefore X ( ** converges 
weakly to X, by the continuity theorem of characteristic 
functions, i.e., the densities p w converge to p pointwise at 55 
every continuous point of p. 

A description of the iterative Gaussianization method a of 
the present invention will now be given. Let X=(x a , . . . , x^ T 
be an arbitrary random variable. We intend to find an 
invertible transform TrR^-^R" such that T(X) is distributed 60 
as the standard Gaussian, i.e. we would like to find T such 
that 

J(TWH>. (7) 

When X is assumed to have independent components 65 
after certain linear transforms, we have constructed the 
Gaussianization transform above via the EM algorithm: 



maxE x {\ozpe(X)), 

where the density model p e (x) is the density of "T^Y). This 
easily follows from 

A convergence proof will now be given regarding the 
iterative Gaussianization method of the invention. We now 
analyze the Gaussianization process 

where the linear transform A and the marginal Gaussianiza- 
tion parameters In, a} are obtained by minimizing the 
negentropy of X ( \ For the sake of simplicity, we assume 
that we can achieve perfect univariate Gaussianization for 
any univariate random variable. In fact, when the number of 
Gaussians goes to infinity, it can be shown that the univariate 
Gaussianization parameterized by mixtures of univariate 
Gaussians above indeed converges. Therefore, in our con- 
vergence analysis, we do not consider the approximation 
error of univariate variables by mixtures of univariate Gaus- 
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sians. We assume the perfect marginal Gaussian ization cen- Therefore, 
ter V is available which does not need to be estimated. We 

shall analyze the ideal iterative Gaussianization procedure sup J(a T x (k) ) s /(X U) ) -m//((/ 0 x U) ) s - inf i(A)t k) ) 

Ua arbitrary A 



i.e., 



where the linear transform A is obtained by minimizing the J*Qe^^A^\ 

negentropy of X<* +1 >. Obviously, if the ideal iterative pro- 1Q since A W_> 0 , we have 
cedure converges, the actual iterative procedure also 
converges, when the number of univariate Gaussians goes to 

infinity. Applying the sixth assumption, we can now establish our 

From the first and third assumptions above, it is clear that convergence result, 

the negentropy of X*** 1 ) is 15 F° r iterative Gaussianization as in equation (8), 



in the sense weak convergence, i.e., the density function of 
Therefore, the iterative Gaussianization process attempts to 2Q xW converges pointwise to the density function of standard 
linearly transform the current variable X Cfr) to the most normal. ... 
inde endent view* description will now be given of how the iterative 

Gaussianization method of the invention can be viewed as a 
parametric generalization of high dimensional projection 

miiiJ(*MA ( *>)) ~ tmaliAX* 1 ). pursuit. 

25 A nonparametric density estimation scheme called pro- 
jection pursuit density estimation is described by Friedman 

We now prove the convergence of the iterative Gaussi- et al., in "Projection Pursuit Density Estimation", J. Ameri- 

anization method of the invention. Our proof resembles the can Statistical Association, Vol. 79, pp. 599-608, September 

proof of convergence for projection pursuit density esti- 1984 * Its convergence result was subsequently proved by R 

mates described by P. J. Huber, in "Projection Pursuit", 30 J - Huber, in "Projection Pursuit", Annals of Statistics, Vol. 

Annals of Statistics, Vol. 13, pp. 435-525, April 1985. Let J 3 ' *?>• 435-525, April 1985. In this scheme, the density 

A . 4 . . t - - 4l _ t • *u i *u •* *• function f(x) of a random variable XeR is approximated by 

Au be the reduction in the negentropy in the k-th iteration: , t x £ . « t rr J 



A 1 " = J(X^)-J{X^ l) ) = KX^-infliA^). 35 



a product of ridge functions 



Since {J(X^} is a monotonically decreasing sequence and 

bounded from below by 0, we have 4Q where p 0 (x) is some standard probability density in R" (e.g. 

a normal density with the same and covariance as p(x)). At 
lim A u> = 0. eacn iteration (k+1), the update h^ +1 can be obtained by 



In fact, for any given e>0, it takes at most k-J(X^/e to 45 ™ h ™ m f direction <x w is constrained to have unit length 
reach a variable X» such that A«s c . ™* °™ * C ° mpmed by 

Following the argument presented in the immediately ^ = at0tm3cD(p{t / Jtm {cx r x)) (ioj 

preceding article by Huber, the maximum marginal negent- 0 
ropy is defined as 50 

It has been suggested to estimate the marginal densities 

J*(X)= supJ{otX). p(a r x) of the data by histograms or kernel estimates with the 

ml observed samples and to estimate the marginal density 

pjXa r x) of the current model by histograms or kernel esti- 

m 1 T /vW\ n • i- T*/ V fitt\ n 11 iL 55 mates with Monte Carlo samples, then to optimize (as in 

Clearly, J(X w )-*° implies J*(X w )-*0. However, the \ ' f. v , 

. v ' M r >„ \ „ , ' . equation 10) by gradient decent. These suggestions are made 

reverse is not necessarily true. We shall show that the b * Vlied ^ a el aL> in «p roje ction Pursuit Density 

maximum marginal negentropy of X< > is bounded from Estimation", J.American Statistical Association, Vol. 79, pp. 

above by A c \ For any unit vector aflWb-l), let U a be an 599-608, September 1984; and P. J. Huber, in "Projection 

orthogonal completion of a: 60 Pursuit", Annals of Statistics, Vol. 13, pp. 435-525, April 

1985. In our view, this scheme attempts to approximate the 

^ a -[^^2> • • • ><*J- multivariate density by a series of univariate densities; the 

scheme finds structures in the domain of density function. 

Applying the first and second assumptions above, we However, since there is no transforms involved, the scheme 

have 65 does not explicitly find structures in the data domain. In 

addition, the kernel estimation, the Monte Carlo sampling, 

J(aO^ ) )^ M (y o y^/(ry/^*>)^^^-/(r^ r A ( * ) ). and the gradient decent can be quite cumbersome. 
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Data is explicitly transformed by the exploratory projec- 
tion pursuit described by J. H. Friedman, in "Exploratory 
Projection Pursuit", J. American Statistical Association, Vol. 
82, pp. 249-66, March 1987. Let X (0) be the original random 
variable. At each iteration, the most non-Gaussian projection 
of the current variable X CA) is found: 



20 



where the operator ^P 1 marginally Gaussianizes the first I 
coordinates and leaves the remaining coordinates fixed; A is 
obtained by minimizing the negentropy of X ( * +1) : 



min./<¥ 1 (AJK tt) )) 



(12) 



a k = argmiaD{a r X* 1 [\G a ) 

a 

where G a is the univariate Gaussian random variable with 
the same mean and covariance as a r X (fr) . Let U be an 
orthogonal matrix with c^. being its first column 

u-[a k . . . } 

One then transforms X w into the U domain: 
Y-u T X k \ 

The first coordinate of Y is Gaussianized, and the 
remaining coordinates are left unchanged: 

W^=2, . . . , n 

One obtains X< fcvl ) as 

If we denote } ¥ 1 as the operator which Gaussianizes the 
first coordinate and we leave the remaining coordinates 
unchanged, then we have 



In practice, we estimate the ideal operator W J by mixtures 
of univariate Gaussians as described above. For XeR", we 
10 define 



The two dimensional exploratory projection pursuit was 
also described by J. H. Friedman, in "Exploratory Projection 
Pursuit", J. American Statistical Association, Vol. 82, pp. 
249-66, March 1987. According to the two dimensional 40 
exploratory projection pursuit, at each iteration, one locates 
the most jointly non-Gaussian two dimensional projection of 
the data and then jointly Gaussianizes that two dimensional 
plane. To jointly Gaussianize a two dimensional variable, 
rotated (about the origin) projections of the two dimensional 45 
plane are repeatedly Gaussianized until it becomes like 
normal. More specifically, let Y=(y l7 y 2 ) r be the two dimen- 
sional plane which is most non-Gaussian. 

Let 



y \-yi cos y+y 2 sin y 



y' 2 =y 2 cos Y+y : sin y 



50 



be a rotation about the origin through angle y. y\ and y' 2 can 
then be Gaussianized individually. This process is repeated 
(on the previously rotated variables) for several values of 
Y=(0,n/8,Ji/4,3jt/8, . . . ). This entire process is then repeated 
until the distributions stop becoming more normal. 

The iterative Gaussianization method of the invention can 
be easily modified to achieve high dimensional exploratory 
projection pursuit with efficient parametric optimization. 
Assume that we are interested in 1-dimensional projections 
where 1^1 ^n. The modification will now be described. At 
each iteration, instead of marginally Gaussianizing all the 
coordinates of AX (x \ we marginally Gaussianize the first 1 
coordinates only: 



where 



V 1 £<t* 1 



20 



This is equivalent to assuming that X has the following 
distribution: 



25 



pxfri *«) = 



l id 1[" 

d=l i=l J L=/+] 



G(x dl 0, 1) 



30 



35 



55 



Therefore, the EM method of the invention described 
above can be easily accommodated to model by setting 
the number of univariate Gaussians in dimensions 1+1 
through n to be one: 

A description of a constrained version of the iterative 
Gaussianization method of the invention will now be given. 
To first whiten the data and then consider only orthogonal 
projections for two dimensional projection pursuit has been 
described by J. H. Friedman, in "Exploratory Projection 
Pursuit", J. American Statistical Association, Vol. 82, pp. 
249-66, March 1987. Here, the same assumption is made; 
we constrain the linear transform A to be a whitening 
transform W followed by an orthogonal transform U 



where the whitening matrix W whitens the data X w 

Vir(WO<*>)-l. 

For example, W can be taken as 
We partition U as 



60 



65 



where U 2 are the first 1 rows. Let Y (k) be the whitened 
variable Y^oWX^. Applying the first, third, and second 
assumptions above, we have 



and 
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Taking the difference, we have 

Therefore, the modified iterative Gaussianization method 
(11) with the constraint 

A=UW 

is equivalent to finding the 1-dimensional marginally most 
no n- Gaussian orthogonal projection of the whitened data. 
Let 0 1 be the 1-dimensional marginally most non-Gaussian 
orthogonal projection of the whitened data: 

tf^arg max UJJp x WXt% 

Let 0 be any orthogonal completion of 0 a : 




Then U is a solution of the constrained modified iterative 
Gaussianization procedure 

min 7(+ / (/4X tt, )J. 

A=UW 
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where consists of the last (n-1) rows of A, we can also 
argue that this unconstrained procedure attempts to find the 
most independent view in which the 1 most marginally 
non-Gaussian directions can be chosen. 

s A description of hierarchical independence pursuit 
according to an embodiment of the invention will now be 
given. In the construction of probabilistic models for high 
dimensional random variables, it is crucial to find the 
independence or conditional independence structures 

10 present in the data. For example, if we know that the random 
variable XeR n can be separated into two variables X-(X 1 X 2 ) 
such that XjcR" 1 is independent of R" 2 , we can then 
construct probabilistic models separately for X 1 and X^, 
which can be much simpler than the original problem. 

15 We attempt to transform the original variable to reveal 
independence structures. The iterative Gaussianization 
method of the invention can be viewed as independent 
pursuit, since when it converges, we have reached a standard 
normal variable, which is obviously independent dimension- 

20 wise. Moreover, the iterative Gaussianization method of the 
invention can be viewed as finding independent structures in 
a hierarchical fashion. 

We first analyze the ability of finding independent struc- 
tures at each Gaussianization iteration. We have shown 

25 above that each step of Gaussianization minimizes the 
mutual information 



The EM method of the invention must be modified to 
accommodate for the orthogonality constraint on the linear 
transform. Since the original EM method of the invention 
estimates each row of the matrix, it cannot be directly 
applied here. We propose using either the well-known 
Householder parameterization or Given's parameterization 
for the orthogonal matrix and modifying the M-step accord- 
ingly. 

The weak convergence result of this constrained modified 
iterative Gaussianization can be easily established by deriv- 
ing the key inequality 

and following the same proof described above with respect 
to convergence. Obviously for the larger 1, the convergence 
is faster, since the improvement at iteration k, is larger. 

Friedman's high dimensional exploratory projection algo- 
rithm attempts to find the most jointly non-Gaussian 1 
dimensional projection. The computation of the projection 
index involves high dimensional density estimation. In 
contrast, the Gaussianization method of the invention tries to 
find the most marginally non-Gaussian 1 dimensional pro- 
jection; it involves only univariate density estimation. 

The bottleneck of 1-dimensional exploratory projection 
pursuit is to Gaussianize the most jointly non-Gaussian 
1-dimensional projection into standard Gaussian, as 
described by J. H. Friedman, in "Exploratory Projection 
Pursuit", J. American Statistical Association, Vol. 82, pp. 
249-66, March 1987. In contrast, the Gaussianization 
method of the invention involves only univariate Gaussian- 
ization and can be computed by an efficient EM algorithm. 

A description of an unconstrained version of the iterative 
Gaussianization method of the invention will now be given. 
If we do not employ the orthogonality constraint and we 
solve the more general problem (as per equation 12), we 
would get faster convergence, and the original EM method 
of the invention can be used without any modification. At 
each iteration, since 



min/tAX** 1 ) ( 13 J 

A 

If there exists a linear transform such that the transformed 
variable can be separated into independent sub variables, 
then our minimization of the mutual information will find 
that independent structure. This is precisely the problem of 
Multidimensional Independent Component Analysis pro- 
posed by J.-F. Cardoso, in "Multidimensional Independent 
Component Analysis", Proceedings of ICASSP *98, Seattle, 
Wash., May 1998. It is a generalization of the independent 
component analysis, where the components are independent 
of one another but not necessarily one dimensional. We call 
a random variable XeR M minimal if any linear transform on 
X does not reduce the mutual information 

and all the dimensions of the transformed variable Y are 
dependent. From the description provided by Cardoso in the 
immediately preceding article, we have the following result 
that minimizing the mutual information (13) finds the under- 
lying minimal structure. 

Let Y=(Yi, . . . , Y M ) where each X /tR ™ is minimal. Let 
the observed variable X be a certain linear transform of 
Y:X=BY. Then, the solution of minimizing the mutual 
information is 

B~ l = argminJ(AX). 



The iterative Gaussianization method of the invention 
60 induces independent structures which can be arranged hier- 
archically as a tree. Assume that at the k-th iteration, X 4 ** has 
multidimensional independent components X^pC/^, . . . , 
X m ( **) r , where each multidimensional component Xf® is in 
R^" and Sy.^nj-D. It can be easily shown that the next 
65 iteration of Gaussianization is equivalent to running one 
iteration of Gaussianization on each individual multidimen- 
sional component X/*\ 



02/18/2004, EAST Version: 1.4.1 



US 6,539351 Bl 



23 



If X has multidimensional independent components 
X-PC,, . . . , XJ, then 

argmiiJ(1>(AX)) « argmuJ{V(AjXj))V lsjs«t. 



Therefore, the iterative Gaussianization method of the 
invention can be viewed as follows. We run iterations of 
Gaussianization on the observed variable until we obtain 
multidimensional independent components; we then run 
iterations of Gaussianzation on each multidimensional inde- 
pendent component; and we repeat the process. This induces 
a tree which describes the independence structures among 
the observed random variable XeR". The tree has n leaves. 
Each leaf 1 is associated with its parents and grandparents 
and so on: 

AVVit-^m . . . X roat 

where a\ denotes the parent node of n and X^, is the 
observed variable X. In this tree, all the children of a 
particular node are obtained by running iterations of Gaus- 
sianization until we obtain multidimensional independent 
components; the multidimensional independent components 
are the children. 

In practice, we need to detect the existence of multidi- 
mensional independent components. Accordingly, we pro- 
vide two relevant methods. The first method is based on the 
fact that if X-(x l5 . . . , x rt ) has multidimensional independent 
components X=(X l7 . . . , X^,), then the matrix A, which we 
obtain by running one iteration of Gaussianization on X 



10 



15 
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above can be viewed as parametric density estimation. We 
now describe two forms of density estimates based on 
Gaussianization according to the invention: 

(1) Iterative Models which can be directly obtained from 
the iterative Gaussianization method of the invention 
described above. 

(2) Mixture Models in which each cluster corresponds to 
one iteration of Gaussianization. 

A description of iterative Gaussianization densities 
according to the invention will now be given. Consider the 
sequence of optimal parametric Gaussianization transforma- 
tions T Gjt (as described above) acting on a random variable 
X-X <0 >. Since the sequence X w -T e X ( * _1 > converges to a 
standard normal distribution, for sufficiently large k, X (t) is 
approximately Gaussian. Since the transformations T 6t are 
all invertible, a standard normal density estimate for X (Ar) 
implies a density estimate for X. Indeed, if we define for 
each k 



(14) 



where N-N (0, 1), then the density of can be considered 
a parametric estimate of the density of X for each k. The 
larger k, the better the estimate. If H e (y ( *^) denotes the 
25 Jacobian determinant of T 6fc evaluated at y (fr) , i.e., 



\ dT h Z \ 
dZ I 
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then, the probability distribution of is given by 



min/(AX), 

35 

will be block diagonal. More specifically, we have A^O if 
x, is independent of x y . Therefore, we define the distance 
between dimension i and j to be the absolute value of A,y 

and run the bottom-up hierarchical clustering scheme with 
maximum linkage. We can obtain an estimate of the multi- 
dimensional independent components by applying thresh- 
olding on the clustering tree with a threshold level e>0. 

The second method is also based on bottom-up hierarchi- 45 
cal clustering with maximum linkage; however, the distance 
between dimension i and j is defined to reflect the depen- 
dency between X,- and Xy, such as the Hoeefding statistics, 
the Blum-kiefer-Rosenblatt statistics, or the Kendall's rank 
correlation coefficient, as described by H. Wolfe, Nonpara- so 
metric Statistical Methods, Wiley, 1973. These statistics are 
designed to perform a nonpar ametric test of the indepen- 
dence of two random variables. The threshold level e can be 
chosen according to the significant level of the correspond- 
ing independence test statistics. 55 

The above scheme is also applicable for the purpose of 
speeding up the iterative Gaussianization method of the 
invention. Once we achieve multidimensional independent 
components, we can now run iterations of Gaussianization 
on the individual multidimensional independent components 60 
instead of the entire dimensions, since Gaussianizing a 
lower dimensional variable is computationally advanta- 
geous. 

A description of density estimation via Gaussianization 
according to the invention will now be given. As stated 65 
above, Gaussianization can be viewed as density estimation. 
In particular, our parameterized Gaussianization discussed 



(15) 



~jL^H ei ^)^H h {y^)H h <y»)c*p(-i||7> 4 ... T &1 />|| 2 ) 



Essentially, for each k, Y (/r) comes from a parametric family 
#of densities, for example - k , defined by Equation 15. 

The iterative Gaussianization method of the invention can 
be viewed as an approximation process where the probabil- 
ity distribution of X, e.g., p(x), is approximated in a product 
form (Equation 15). In particular, if p(x) belongs to one of 
the families, e.g., - k> then it has an exact representation. 

A description of mixtures of Gaussianizations will now be 
given. The family - 1 will be called linearly transformed 
Compound Gaussians, and is closely related to Independent 
Component Analysis. -^S) with the restriction that the 
linear transform A«I will be called Compound Gaussians. 
Clearly X={x 1? . . . , x D } is a compound Gaussian if the 
dimensions x/s are independent and each dimension x d is a 
Gaussian mixture 



The density of a compound Gaussian can be written as: 

(16) 



D 

/(* *o) = J"] 2j n wW x ** tHAih oij)) 

d=\ 1=1 



Compound Gaussians are closely related to Gaussian 
mixtures. The definition (equation 16) can be expanded as a 
sum of product terms: 
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5 

Compound Gaussians diagonal covariance Gaussian mix- are independent 

tures with I mixture components: We convert u„ into binary variables {u n l , . . . , u M ^}: 



10 



Similarly we define 
Note that a mixture of 

15 Therefore, 



p{x n , Un^) = p{u) p(l\u) p(x\u t l) ■- 
K 



diagonal Gaussians has 2Q j j j^pj fa^-ota, p kM , 

d=i We obtain the following complete log likelihood 

25 

free parameters. However, in the equivalent definition of the " * 

compound Gaussians (as per equation 17), the priors, means L(x, u,z,Q)=\ \ Un/ 

and variances are constrained in a particular fashion such uT* 
that the number of effective free parameters is merely 

30 L„.V. L_ t**-/**) 2 



The compound Gaussian distribution is specifically 35 T c « „™*k™ A 17 ^,^n 

j . , / . , . . , , . , j * In the h-step, we compute the auxiliary function, 

designed for multivariate variables which are independent r ' r ' 

dimension-wise while non-Gaussian in each dimension. In 

such cases, modeling as mixtures of diagonal Gaussians JL, 

would be extremely inefficient. fip, &) = e(l{x, «, z. 9 = } \ >v„,*iogp t + 

However, compound Gaussians are not able to describe 4Q „=i *=i t 

the dependencies among the dimensions, whereas mixtures 

of diagonal Gaussians can. To accommodate for the 'w , r 2 , 

dependencies, one may consider mixtures of Compound V* WflJtJ iogft M> , - iiogkrgj^ - * x<Mf "^''^ 

Gaussians, which is a generalization of mixtures of diagonal 
Gaussians. 

Gaussianization leads to a collection of families - k B of 45 
probability densities. These densities can be further gener- where 
alized by considering mixtures over ~ k . We have already 

seen that compound Gaussians are not able to d 'w 

describe dependencies between dimensions, while mixtures *M c ( x nAt**M> ^Ai) 

of Compound Gaussians can, illustrating that this generali- 50 WnJt = E(u^\x n , 6) 



zation can sometimes be useful. y-i d W'a> 2 

Adescription of an EM method for training the mixture of 2-j pk ' Hi h * k ' 4>x /V,rfV ' ° V,dv) 

compound Gaussians according to an embodiment of the *' =l 
invention will now be given. Let {x^tR^rl^n^N} be the ^ _ , W e\- 

training set; assume they are independently and identically 55 ~ v*"* z njsr** )~ 

distributed as a mixture of compound Gaussians: 



six) = Yj Tj 7ri -^- G{xdt 
k=[ d=i ;=t 



d'*d <-i 

K i 

k'-i 



Let (x n <LR D t u rt eN, z^eN^n-l, . . . , N) be the complete 
data, where u„ indicates the index of the compound 

Gaussian, and z^i • • ■ Vo) are the indexes of the 65 l * the M^p, W6 maxunrze the auxiliary function Q to 
particular Gaussian along all the dimensions. It is clear that u P date ^ P«™n«er estimation 
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A- 



e ™«* 

n-1 

K N 
Z Z 



2j »W 

n=l 

Z Z VV^^y 
/'=! n-1 

n=l 

'k.d N 

Z Z *>*X4! 
f'-l «=1 



W„4 = £K*b^0)= ■ 



D 4w 



Pi n Z «i <?Ow ,in w , ffi'^^ ) 

t'-I 



l l4 N 

Z Z »W,/' 
i'=l n=i 



ZD V/ 



In the M-step, we maximize the auxiliary function Q to 
20 update the parameter estimation 

6=arg maxaG^e) 



Adescription.will now be given of mixtures of compound We have foUowing iterative scheme. Note that the 
Gaussians with linear transforms (i.e., ^-Mixtures). Let transform matrix A can be initialized as the current estimate 
x Gxl cR D be the random variable we are interested in. or ^ an identity matrix. 



Assume that after a linear transform A, the transformed 
variable y=Ax can be modeled as a mixture of compound 
Gaussians: 



30 



(1) Update the compound Gaussian model parameters: 

N 



D 'k4 

f(y) = ) Pk]~\ E t*U.h cd^i) 



Z Z W m y 
*'-l n-1 
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We describe here an EM method according to the invention 
to estimate both the transform matrix A and the compound 
Gaussian model parameters. Let {x^R^l^n^N} be the 
training set. Let {y„=Ax„:l^n=N} be the transformed data. 
Let (y^eFP, x^eR^, u„ € N, z^N^n-l, . . . , N) be the 
complete data, where u„ indicates the index of the compound 
Gaussian, and z„«(z M A . . . z nJ> are the indexes of the 
particular Gaussian along all the dimensions. We obtain the 
following complete log likelihood 

N K 



E 



l k4 N 

Z Z Wn,k.d,i* 
i'-l i-l 



45 



log?* + ^ «wy[loffr WJ - ^log^V; - ^^JJ^ - ] 



where 6=(A, p, Jt, ft, a). 

In the E-step, we compute the auxiliary function 



Q(9, 0) = E[lJ(x, «, 2, % 9= N 1ogM| + y y 

n=l i=j 



50 



55 



60 



65 



n=l 

Z Z "n.i.d,i> 
?*l n=l 

N 



f k4 N 

Z Z ™**4? 
i'=\ n=l 



where the transformed data are obtained using the 
current estimate of A: y„«Ax M . 
(2) Update the transform matrix A. We will update A row 
by row. Let the vector a,- be the j-th row of A, let 
C / D ( c u ■ ■ * ' c d) ^ e to 6 co- factors of A associated with 
the j-th row. Note that a ; - and c } - are both row vectors. We 
plug in Q(G, 6) the updated (rc, /i, a). 



n»l d=I 1=1 i=l 
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where 



K '14 , N 

■lZUSi"-"- 4 - 



-zzr 



10 



Therefore, the j-th row of a- can be updated as 

where p can be solved from the quadratic equation (6). 

A description of feature extraction via Gaussianization 
will now be given. In the problem of classification, we are 
often interested in extracting a lower dimensional feature 
from the original feature such that the new feature has the 
lowest dimension possible, but retains as much discrimina- 
tive information among the classes as possible. If we have 
lower dimensional features, we often need less raining data 
and we can train more robust classifiers. 

We now consider feature extraction with the naive Baye- 
sian classifier. Let X€.R D be the original feature, and 
ce{l, . . . , L}) be the class label. By the Bayes formula 



P {C\X) : 



p{x\c)p{c) 
Pix) 



[t A, £} = argmaxV logG(7>, p Ci , E) 



25 
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and we classify x as 

2=arg max p(x\c)p{c) 

where p(x|c) is the probability density of the feature vector 
and p(c) is the prior probability for class c. Typically, p(x|c) 
can be modeled as a parametric density 

P{*\C> 9c) 

where the parameters 0 C can be estimated from the training 
data. We seek a transform T such that the naive Bayesian 
classifier in the transformed feature space y«T(x) outper- 
forms the naive Bayesian classifier in the original feature 
space. Let {x,-, c ( :l^i^N} be the training data. 

Prior work has focused on linear transforms. In the 
well-known linear discriminant analysis (LDA), we assume 
that the class densities are Gaussian with a common cova- 
riance matrix 

p(y\c)-G(yji c 2); 

We estimate the transform matrix T via maximum likeli- 
hood 



35 
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Improved Speech Recognition", Ph.D. dissertation, John 
Hopkins University, Baltimore, Md., 1997. The semi-tied 
covariance is described by: R. A. Gopinath, in "Constrained 
Maximum Likelihood Modeling with Gaussian 
Distributions", Proc. of DARPA Speech Recognition 
Workshop, February 8-11, Lansdowne, Va., 1998; and M. J. 
F. Gales, in "Semi- tied Covariance Matrices for Hidden 
Markov Models", IEEE Transactions Speech and Audio 
Processing, Vol. 7, pp. 272-81, 1999. These techniques 
assume each class has its own covariance matrix, however, 
diagonal. Both the linear transform and the Gaussian model 
parameters can be estimated via maximum likelihood as in 
(18). 

We consider here a particular nonlinear transform and we 
optimize the nonlinear transform via maximum likelihood. 
Motivation for our nonlinear transform will now be 
described. Linear techniques such as Heteroscedastic LDA 
and the semi-tied covariance assume diagonal covariance 
and attempt to find a linear transform such that the diagonal 
(i.e. independent) assumption is more valid. This is 
described by R. A. Gopinath, in "Constrained Maximum 
Likelihood Modeling with Gaussian Distributions", Proc. of 
DARPA Speech Recognition Workshop, February 8-11, 
Lansdowne, Va., 1998. We find a transform such that the 
Gaussian assumption is more valid, which leads us to 
Gaussianization. 

We opt to Gaussianize each dimension separately, which 
can be followed by, e.g., the semi-tied covariance technique, 
to find the best linear projection. Theoretically, we can write 
out the log likelihood with respect to both the Gaussianiza- 
tion parameters and the linear projection parameters. 
However, it is extremely difficult to optimize them for a 
large dataset, such as in automatic speech recognition. 

Without loss of generality, we assume xeR 1 . It is impor- 
tant to choose the proper parameterization of Gaussianiza- 
tion such that the maximum likelihood optimization can be 
carried out efficiently. We parameterize Gaussianization as a 
piece wise linear transform Gaussianization transform. We 



Let 



45 



(19) 



where h m are the first-order splines 



t x -r 0 ('o h) 



-t(t m sxzl^o 



55 



Notice that 



60 



= QWi -a m if t m ix* t m . 



Recently, there have been generalizations of LDA to 
heteroscedastic cases, e.g., the Heteroscedastic LDA and the We assume univariate Gaussian distribution for the trans- 
semi-tied covariance. The Heteroscedastic LDA is described 65 formed variable y 
by N. Kumar, in "Investigation of Silicon-Auditory Models 

and Generalization of Linear Discriminant Analysis for piy^^-G^fi^ 2 ) 
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The log likelihood function 

JV r , 



Denote 



Ni = £/(cj = l) 

and 

N 

7*. = £/(0»** n <'m + i) 



Clearly, 



i=] 

*1 



^/(c; = l)(^-Mi) 2 

* 



/*-0 



£ a n a m ,H ljn ^ -ft] 



where 



i=i 



Denote 

oo(Oo • • • a„.j r 
A i = C*o • ■ ■ *«*i) r 

We have 

Therefore, we can rewrite the log likelihood as 
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A description of various implementations of the methods 
of the present invention will now be given, along with test 
results of the iterative Gaussianization method of the inven- 
tion. 

It is to be understood that the present invention may be 
implemented in various forms of hardware, software, 
firmware, special purpose processors, or a combination 
thereof. Preferably, the present invention is implemented in 
software as an application program tangibly embodied on a 
program storage device. The application program may be 
uploaded to, and executed by, a machine comprising any 
suitable architecture. Preferably, the machine is imple- 
mented on a computer platform having hardware such as one 
or more central processing units (CPU), a random access 
memory (RAM), and input/output (I/O) interface(s). The 
computer platform also includes an operating system and 
micro instruction code. The various processes and functions 
described herein may either be part of the micro instruction 
code or part of the application program (or a combination 
thereof) which is executed via the operating system. In 
addition, various other peripheral devices may be connected 
to the computer platform such as an additional data storage 
device and a printing device. 

It is to be further understood that, because some of the 
constituent system components and method steps depicted 
in the accompanying Figures are preferably implemented in 
software, the actual connections between the system com- 
ponents (or the process steps) may differ depending upon the 
manner in which the present invention is programmed. 
Given the teachings of the present invention provided 
herein, one of ordinary skill in the related art will be able to 
contemplate these and similar implementations or configu- 
rations of the present invention. 

FIG. 6 is a block diagram of a computer processing 
system 600 to which the present invention may be applied 
according to an embodiment thereof. The computer process- 
ing system includes at least one processor (CPU) 602 
operatively coupled to other components via a system bus 
604. A read-only memory (ROM) 606, a random access 
memory (RAM) 608, a display adapter 610, an I/O adapter 
612, and a user interface adapter 614 are operatively coupled 
to the system but 604 by the I/O adapter 612. 

A mouse 620 and keyboard 622 are operatively coupled to 
the system bus 604 by the user interface adapter 614. The 
mouse 620 and keyboard 622 may be used to input/output 
information to/from the computer processing system 600. It 
is to be appreciated that other configurations of computer 
processing system 600 may be employed in accordance with 
the present invention while maintaining the spirit and the 
scope thereof. 

The iterative Gaussianization method of the invention, 
both as parametric projection pursuit and as hierarchical 
independent pursuit, can be effective in high dimensional 
structure mining and high dimensional data visualization. 
Also, most of the methods of the invention can be directly 
applied to automatic speech and speaker recognition. For 
example, the standard Gaussian mixture density models can 
be replaced by mixtures of compound Gaussians (with or 
without linear transforms). 

TABLE 1 



We constrain 




Logarithm 


1-d Gaussianization 




Word Error Rate 


185% 


18.1% 



i.e. we assume that the Gaussianization transform (as per 65 
equation 19) is strictly increasing. We can solve this using a 
standard numerical optimization package. 



The nonlinear feature extraction method of the present 
invention can be applied to the front end of a speech 
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recognition system. Standard mel-cepstral computation 
involves taking the logarithm of the Mel-binned Fourier 
spectrum. According to an embodiment of the invention, the 
logarithm is replaced by univariate Gaussianization along 
each dimension. The univariate Gaussianization is estimated 
by pooling the entire training data for all the classes. Table 
1 illustrates the results of using this technique on the 1997 
DARPA HUB4 Broadcast News Transcription Evaluation 
data using an HMM system with 135 K Gaussians. As 
shown, modest improvements (0.4%) are obtained. One 
would expect that the nonlinear feature extraction algorithm 
provided above, which is optimized for multiple population 
would perform better. 

Although the illustrative embodiments have been 
described herein with reference to the accompanying 
drawings, it is to be understood that the present invention is 
not limited to those precise embodiments, and that various 
other changes may be affected therein by one of ordinary 
skill in the art without departing from the scope or spirit of 
the invention. All such changes and modifications are 
intended to be included within the scope of the invention as 
defined by the appended claims. 

What is claimed is: 

1. A method for generating a high dimensional density 
model within an acoustic model for one of a speech and a 
speaker recognition system, the density model having a 
plurality of components, each component having a plurality 
of coordinates corresponding to a feature space, the method 

'comprising the steps of: 

transforming acoustic data obtained from at least one 
speaker into high dimensional feature vectors; 

forming the density model to model the feature vectors by 
a mixture of compound Gaussians with a linear 
transform, wherein each compound Gaussian is asso- 
ciated with a compound Gaussian prior and models 
each of the coordinates of each of the components of 
the density model independently by a univariate Gaus- 
sian mixture comprising a univariate Gaussian prior, 
variance, and mean. 

2. The method according to claim 1, further comprising 
the step of applying an iterative expectation maximization 
(EM) method to the feature vectors, to estimate the linear 
transform, the compound Gaussian priors, and the univariate 
Gaussian priors, variances, and means. 

3. The method according to claim 2, wherein the EM 
method comprises the steps of: 

computing an auxiliary function Q of the EM method; 

respectively updating the compound Gaussian priors and 
the univariate Gaussian priors, to maximize the auxil- 
iary function Q; 

respectively updating the univariate Gaussian variances, 
the linear transform row by row, and the univariate 
Gaussian means, to maximize the auxiliary function Q; 

repeating said second updating step, until the auxiliary 
function Q converges to a local maximum; 

repeating said computing step and said second updating 
step, until a log likelihood of the feature vectors con- 
verges to a local maximum. 

4. The method according to claim 3, further comprising 
the step of updating the density model to model the feature 
vectors by the mixture of compound Gaussians with the 
updated linear transform, wherein each of the compound 
Gaussians is associated with one of the updated compound 
Gaussian priors and models each of the coordinates of each 
of the components independently by the univariate Gaussian 
mixtures comprising the updated univariate Gaussian priors, 
variances, and means. 
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5. The method according to claim 3, wherein the linear 
transform is fixed, when the univariate Gaussian variances 
are updated. 

6. The method according to claim 3, wherein the univari- 
ate Gaussian variances are fixed, when the linear transform 
is updated. 

7. The method according to claim 3, wherein the linear 
transform is fixed, when the univariate Gaussian means are 
updated. 

8. A method for generating a high dimensional density 
model within an acoustic model for one of a speech and a 
speaker recognition system, comprising the steps of: 

transforming-acouslic data obtained from at least one 
speaker into high dimensional feature vectors; 

forming the density model to model the feature vectors by 
a mixture of compound Gaussians with a linear 
transform, wherein each compound Gaussian is asso- 
ciated with a compound Gaussian prior and models 
each coordinate of each component of the density 
model independently by a univariate Gaussian mixture 
comprising a univariate Gaussian prior, variance, and 
mean; 

applying an iterative expectation maximization (EM) 
method to the feature vectors, comprising the steps of: 

computing an auxiliary function Q of the EM method; 

respectively updating the compound Gaussian priors and 
the univariate Gaussian priors, to maximize the func- 
tion Q; 

respectively updating the univariate Gaussian variances, 
the linear transform row by row, and the univariate 
Gaussian means, to maximize the function Q; 

repeating said second updating step, until the auxiliary 
function Q converges to a local maximum; 

repeating said computing step and said second updating 
step, until a log likelihood of the feature vectors con- 
verges to a local maximum; and 

updating the density model to model the feature vectors 
by the mixture of compound Gaussians with the 
updated linear transform, wherein each compound 
Gaussian is associated with one of the updated com- 
pound Gaussian priors and models each of the coordi- 
nates of each of the components independently by 
univariate Gaussian mixtures comprising the updated 
univariate Gaussian priors, variances, and means. 

9. The method according to claim 8, wherein the linear 
transform is fixed, when the univariate Gaussian variances 
are updated. 

10. The method according to claim 8, wherein the univari- 
ate Gaussian variances are fixed, when the linear transform 
is updated. 

11. The method according to claim 8, wherein the linear 
transform is fixed, when the univariate Gaussian means are 
updated. 

12. The method according to claim 8, further comprising 
the step of determining the log likelihood of the high 
dimensional acoustic data, prior to said applying step. 

13. The method according to claim 12, wherein the 
auxiliary function Q is computed based upon the log like- 
lihood of the feature vectors. 

14. A program storage device readable by machine, tan- 
gibly embodying a program of instructions executable by the 
machine to perform method steps for generating a high 
dimensional density model within an acoustic model for one 
of a speech and a speaker recognition system, said method 
steps comprising: 

forming the density model to model feature vectors 
obtained from at least one speaker by a mixture of 
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compound Gaussians with a linear transform, wherein 
each compound Gaussian is associated with a com- 
pound Gaussian prior and models each coordinate of 
each component of the density model independently by 
a univariate Gaussian mixture comprising a univariate 5 
Gaussian prior, variance, and mean; 

applying an iterative expectation maximization (EM) 
method to the feature vectors, comprising the steps of: 

computing an auxiliary function Q of the EM method; 

respectively updating the compound Gaussian priors and 
the univariate Gaussian priors, to maximize the func- 
tion Q; 

respectively updating the univariate Gaussian variances, 
the linear transform row by row, and the univariate 55 
Gaussian means, to maximize the function Q; 

repeating said second updating step, until the auxiliary 
function Q converges to a local maximum; 

repeating said computing step and said second updating 
step, until a log likelihood of the feature vectors con- 20 
verges to a local maximum; and 

updating the density model to model the feature vectors 
by the mixture of compound Gaussians with the 
updated linear transform, wherein each compound 



351 Bl 

36 

Gaussian is associated with one of the updated com- 
pound Gaussian priors and models each of the coordi- 
nates of each of the components independently by 
univariate Gaussian mixtures comprising the updated 
univariate Gaussian priors, variances, and means. 

15. The program storage device according to claim 14, 
wherein the linear transform is fixed, when the univariate 
Gaussian variances are updated. 

16. The program storage device according to claim 14, 
wherein the univariate Gaussian variances are fixed, when 
the linear transform is updated. 

17. The program storage device according to claim 14, 
wherein the linear transform is fixed, when the univariate 
Gaussian means are updated. 

18. The program storage device according to claim 14, 
further comprising the step of determining the log likelihood 
of the high dimensional acoustic data, prior to said applying 
step. 

19. The program storage device according to claim 18, 
wherein the auxiliary function Q is computed based upon the 
log likelihood of the feature vectors. 

* * * * * 
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